Goto

Collaborating Authors

 multiple modelcheckpoint callback


Introducing Multiple ModelCheckpoint Callbacks

#artificialintelligence

When training a model, there is always a chance that something might fail unexpectedly. Proper checkpointing provides a safety net during failures that enables users to restore the state of the model and trainer from a checkpoint file. In Lightning, checkpointing is a core feature in the Trainer and is turned on by default to create a checkpoint after each epoch. But checkpointing provides more than just a safety net in case of failure. Often we care about keeping track of the "best" model weights encountered during the course of training, because in practice not every new epoch leads to an improved generalization error (unstable optimization, overfitting).